The Implementation of Hadoop-based Crawler System and Graphlite-based PageRank-Calculation In Search Engine

نویسندگان

  • Qingpei Guo
  • Chao Xu
  • Yang Song
چکیده

Nowadays, the size of the Internet is experiencing rapid growth. As of December 2014, the number of global Internet websites has more than 1 billion and all kinds of information resources are integrated together on the Internet , however,the search engine is to be a necessary tool for all users to retrieve useful information from vast amounts of web data. Generally speaking, a complete search engine includes the crawler system, index building systems, sorting systems and retrieval system. At present there are many open source implementation of search engine, such as lucene, solr, katta, elasticsearch, solandra and so on. The crawler system and sorting system is indispensable for any kind of search engine and in order to guarantee its efficiency , the former needs to update crawled vast amounts of data and the latter requires real-time to build index on newly crawled web pages and calculae its corresponding PageRank value. It is unlikely to accomplish such huge computation tasks depending on a single hardware implementation of the crawler system and sorting system,from which aspect, the distributed cluster technology is brought to the front. In this paper, we use the hadoop Map Reduce computing framework to implement a distributed crawler system, and use the GraphLite , a distributed synchronous graph-computing framework, to achieve the real-time computation in getting the PageRank value of the new crawled web page.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Review of ranked-based and unranked-based metrics for determining the effectiveness of search engines

Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...

متن کامل

Towards Supporting Exploratory Search over the Arabic Web Content: The Case of ArabXplore

Due to the huge amount of data published on the Web, the Web search process has become more difficult, and it is sometimes hard to get the expected results, especially when the users are less certain about their information needs. Several efforts have been proposed to support exploratory search on the web by using query expansion, faceted search, or supplementary information extracted from exte...

متن کامل

A Distributed P2P Link Analysis Based Ranking System

Link Based approaches are among the most popular ranking approaches employed by search engines. They make use of the inherent linkage based structure of World Wide Web documents assigning each document an importance score. This importance score is based on the incoming links for a document; a document which is pointed to by many high quality documents should have a higher importance score. Goog...

متن کامل

Using SiteRank for Decentralized Computation of Web Document Ranking

The PageRank algorithm demonstrates the significance of the computation of document ranking of general importance or authority in Web information retrieval. However, doing a PageRank computation for the whole Web graph is both time-consuming and costly. State of the art Web crawler based search engines also suffer from the latency in retrieving a complete Web graph for the computation of PageRa...

متن کامل

RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee

Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover “most” of the Web? How can I know I am ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1506.00130  شماره 

صفحات  -

تاریخ انتشار 2015